ctrfandomcom-20200214-history
Ring counts in a SMILES file
The goals of this task are to show how to read molecules from a SMILES file and how to get the number of cycles in a molecule. As is usual in chemistry software, the data structure called a "molecule" may contain more than one chemistry molecules, which in a graph theory terms is called a connected component. Ring counts can be used as descriptors and as ways to classify structures. One way to compute the ring count is to compute the Euler characteristic: #Rings = #Bonds - #Atoms + #Components. Many chemistry toolkits, though not all, implement a ring finding algorithm which identifies the most chemically relevant rings. This is not a simple problem: consider that cubane has 12 bonds and 8 atoms, giving 5 rings, which implies that one of the faces is not in a ring. There is a large number of ring finding algorithms. Many have the goal of finding a "smallest set of smallest rings", so are called an SSSR algorithm. Frequently the easiest way to get the ring count from a toolkit comes as a direct consequence of computing the SSSR. Note though that ring identification is not necessary for this task. Implementation Use the SMILES file from the benzodiazepine data set, available from http://dalkescientific.com/writings/benzodiazepine.smi.gz. For each input record, find the number of rings. This would be the minimal number of rings and not the total number of all possible rings. Many of the 12386 records contain multiple components, most often counter-ions but some are more complex. Please report the total ring count for all components in the SMILES, even though that does not make much pharmacological sense. The output will have one integer count per line, in the same order as the input file. You should check your results against the reference output. OpenBabel/Pybel import pybel for mol in pybel.readfile("smi", "benzodiazepine.smi"): print len(mol.OBMol.GetSSSR()) OpenBabel/Rubabel require 'rubabel' Rubabel.foreach("benzodiazepine.sdf.gz") {|mol| puts mol.ob_sssr.size } OpenEye/Python OpenEye is known for it's stance against SSSR (See SSSR Considered Harmful.) They do not provide a function to determine this property directly. Instead, use Euler's characteristic to compute it given the number of atoms, bonds, and components: #Rings = #Bonds - #Atoms + #Components . from openeye.oechem import * ifs = oemolistream() ifs.open("benzodiazepine.smi") for mol in ifs.GetOEGraphMols(): num_components, component_membership = OEDetermineComponents(mol) num_rings = mol.NumBonds() - mol.NumAtoms() + num_components print num_rings RDKit/Python As is pointed out so clearly by the OpenEye rant linked above, a true SSSR algorithm returns an arbitrary selection of rings in certain fused ring cases (e.g. adamantane and cubane). This makes little chemical sense and is confusing. Still, it is often useful to have the set of smallest rings in the system. The RDKit, by default, generates a symmetrized version of SSSR; this leads to cubane having the expected 6 rings and a SMARTS match for "R3" returning all 8 atoms. For the purposes of this exercise, the code below shows the use of Euler's rule as well as outputting the SMILES of the molecules where there's a mismatch between symmetrized and standard SSSR algorithms: from rdkit import Chem suppl = Chem.SmilesMolSupplier('benzodiazepine.smi',titleLine=False) counts = [] for m in suppl: rc = m.GetRingInfo().NumRings() nfrags = len(Chem.GetMolFrags(m)) ec = m.GetNumBonds()-m.GetNumAtoms()+nfrags if ec!=rc: print Chem.MolToSmiles(m),ec,rc counts.append((ec,rc,m.GetProp('_Name'))) Chemkit/C++ #include #include #include #include using namespace chemkit; int main(void) { MoleculeFile file("benzodiazepine.smi.gz"); bool ok = file.read(); if(!ok){ std::cerr << "Error: " << file.errorString() << std::endl; return -1; } foreach(const Molecule *molecule, file.molecules()){ std::cout << molecule->ringCount() << std::endl; } return 0; } Cactvs/Tcl molfile loop benzodiazepine.smi eh { puts get $eh E_NSSSR } For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR. Cactvs/Python for eh in Molfile('benzodiazepine.smi'): print(eh.E_NSSSR) For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR. CDK/Groovy import org.openscience.cdk.interfaces.*; import org.openscience.cdk.graph.*; import org.openscience.cdk.io.*; import org.openscience.cdk.io.IChemObjectReader.Mode; import org.openscience.cdk.io.iterator.*; import org.openscience.cdk.ringsearch.*; import org.openscience.cdk.silent.*; import org.openscience.cdk.tools.manipulator.*; import java.io.File; import java.util.zip.GZIPInputStream; iterator = new IteratingSMILESReader( new GZIPInputStream( new File("ctr/benzodiazepine.smi.gz") .newInputStream() ), SilentChemObjectBuilder.getInstance() ) iterator.setReaderMode( IChemObjectReader.Mode.STRICT ) while (iterator.hasNext()) { mol = iterator.next() components = ConnectivityChecker.partitionIntoMolecules( mol ) totalRingCount = 0 for (component in components.atomContainers()) { ringset = new SSSRFinder(component).findSSSR() totalRingCount += ringset.atomContainerCount } println totalRingCount } Category:OpenBabel/Pybel Category:SMILES Category:Ring count Category:OpenEye/Python Category:RDKit/Python Category:Cactvs/Tcl Category:CDK/Groovy Category:Cactvs/Python